© Scott Robison 2021 all rights reserved.


Linear Models and Estimation by Least Squares

For Chapter 11: Linear Models and Estimation by Least Squares, we will be covering the same material as the text book but likely not in the same order. I will list the page range for the whole of Chapter 11 but will go at my own pace and order, so please reference the text book at your own convenience

Chapter 11 pages 563-609 from the text.

Simple Linear Regression: Evaluating the Quality of the Linear Relationship

We began our univariate study of numeric variables. By understanding the “standardized” deviations from the mean. Recall:

\[Z=\frac{Y-\mu_Y}{\sigma⁄\sqrt{n}}\]

We can see that the “non-standardized” deviations this concept is based on is: \(Y-\mu_Y\) when we only have sample data we settle for: \(Y-\overline{Y}\). For repeated samples…: \(\sum_{i=1}^n(Y-\overline{Y})^2\) this is called the total sum of squared errors (SST).

Think of it this way. When considering random variables, \(Y_i\)’s we have seen the MLE for collection normal or “large” random samples is \(\overline{Y}\). Now, when have discovered that \(Y_i\)’s that are linearly dependent on \(X_i\)’s \((Y_i |X_i)\) a better estimator than simply \(\overline{Y}\), is \(\widehat{Y}_i\) the SLR estimator.

We have developed the least-squares estimate \(\widehat{Y}_i=\widehat{\beta}_0 +\widehat{\beta}_1 X_i\) from the sum of squared residual errors, \(SSE=\sum_{i=1}^n\varepsilon_i^2 =\sum_{i=1}^n(Y_i-\widehat{Y} _i)^2 =\sum_{i=1}^n(Y_i- (\widehat{\beta}_0 +\widehat{\beta}_1 X_i))^2 .\)

We now see that we have two estimators: \(\overline{Y}\) and \(\widehat{Y}_i\) and two their respective errors (SST and SSE). In comparing the “size” of these two errors we can get an idea of how much more of the error we were able “explain” in \(Y\) when we considered the linear dependence conditioned on \(X\).

\(\sum_{i=1}^n(Y_i-\overline{Y} )^2 =\sum_{i=1}^n(Y_i-\widehat{Y}_i)^2 +\text{ Sum of Squared "explained " error,by the regression (SSR)}\)

\[SST=SSE+SSR\]

Total sum of squared error = sum of squared error still in regression + sum of squared error explained by linear model.

By looking at the graph we can see how to express SSR

\[\begin{align} SSR &=\sum_{i=1}^n(\widehat{Y}_i-\overline{Y})^2 \\ SST &=\sum_{i=1}^n(Y_i-\overline{Y} )^2\\ &=\sum_{i=1}^n(Y_i-\widehat{Y}_i)^2 +\sum_{i=1}^n(\widehat{Y}_i-\overline{Y})^2 \end{align}\]

Coefficient of Determination, \(r^2\)

A very popular contextual interpretation for how well a least-squared estimator performs is to consider the ratio of explained variation (conditioned on a dependent variable \(X\), \(SSR\)) and total (original, univariate, \(SST\)) variation.

Coefficient of Determination\(=r^2=\frac{SSR}{SST}\)=

Why use the familiar symbol \(r\) (correlation)?

Other expressions of \(SSR,SST ,SSE\) etc.

Example 1

The American Automobile Association has published data (Defensive Driving: Managing Time and Space, 1991) that looks at the relationship between the average stopping distance ( \(y =\) distance, in feet) and the speed of a car (\(x =\) speed, in miles per hour). The data set carstopping.csv contains 8 such data points. (shape)

X=carstopping$Speed
Y=carstopping$Distance

cov(X,Y)
## [1] 3403.571
cor(X,Y)
## [1] 0.9749928
cor(X,Y)^2
## [1] 0.950611
fit=lm(Y~X)
summary(fit)
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.738 -22.351  -7.738  16.622  47.083 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -44.1667    22.0821   -2.00   0.0924 .  
## X             5.6726     0.5279   10.75 3.84e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.21 on 6 degrees of freedom
## Multiple R-squared:  0.9506, Adjusted R-squared:  0.9424 
## F-statistic: 115.5 on 1 and 6 DF,  p-value: 3.837e-05
fit$residuals
##          1          2          3          4          5          6          7 
##  44.166667   7.440476 -19.285714 -31.011905 -32.738095 -19.464286   3.809524 
##          8 
##  47.083333
SSE=sum(fit$residuals^2)
SSR=sum((fit$fitted.values-mean(Y))^2)
SSt=SSR+SSE
SST=var(Y)*(length(Y)-1)


anova(fit)
plot(Y~X)
abline(fit,col="blue")

Solution

Example 2

Is there a relationship between tobacco use and alcohol use? The British government regularly conducts surveys on household spending. One such survey (Family Expenditure Survey, Department of Employment, 1981) determined the average weekly expenditure on tobacco (\(x\), in British pounds) and the average weekly expenditure on alcohol (\(y\), in British pounds) for households in \(n = 11\) different regions in the United Kingdom. The fitted line plot of the resulting data alcoholtobacco.csv: (outlier)

X=alcoholtobacco$Tobacco
Y=alcoholtobacco$Alcohol


cor(X,Y)
## [1] 0.2235721
cor(X,Y)^2
## [1] 0.04998449
fit=lm(Y~X)
summary(fit)
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7080 -0.4245  0.2311  0.6081  0.9020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   4.3512     1.6067   2.708   0.0241 *
## X             0.3019     0.4388   0.688   0.5087  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8196 on 9 degrees of freedom
## Multiple R-squared:  0.04998,    Adjusted R-squared:  -0.05557 
## F-statistic: 0.4735 on 1 and 9 DF,  p-value: 0.5087
anova(fit)
plot(Y~X)

abline(fit,col="blue")
points(x = X[11],y = Y[11],col="red")

alcoholtobacco1=alcoholtobacco[-11,]
alcoholtobacco
alcoholtobacco1
X1=alcoholtobacco1$Tobacco
Y1=alcoholtobacco1$Alcohol

cor(X1,Y1)
## [1] 0.7842873
cor(X1,Y1)^2
## [1] 0.6151066
fit1=lm(Y1~X1)
summary(fit1)
## 
## Call:
## lm(formula = Y1 ~ X1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51092 -0.42434  0.06056  0.34406  0.62991 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   2.0412     1.0014   2.038  0.07586 . 
## X1            1.0059     0.2813   3.576  0.00723 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.446 on 8 degrees of freedom
## Multiple R-squared:  0.6151, Adjusted R-squared:  0.567 
## F-statistic: 12.78 on 1 and 8 DF,  p-value: 0.007234
anova(fit1)
plot(Y~X)

abline(fit,col="blue")
points(x = X[11],y = Y[11],col="red")
abline(fit1,col="red")

Solution

Least-Squares conditions/assumptions

The linear model estimated through least-squares requires a two conditions/assumptions:

  1. The response variable \(Y\) is Normally distributed regardless of the value of the predictor variable \(X_i\).

  2. The variation in the response variable Y is the same regardless of the value of the predictor variable \(X_i\). This concept of ‘common variation/standard deviation’ is called homoscedasticity. This common variance in the response variable \(Y\) is represented by \(\sigma^2\).

Residual plots and normal plots help us evaluate these conditions and reveal other useful traits to us!!!

Revisit examples:
X=carstopping$Speed
Y=carstopping$Distance

fit=lm(Y~X)

plot(Y~X)
abline(fit,col="red")

plot(fit)

X=alcoholtobacco$Tobacco
Y=alcoholtobacco$Alcohol

fit=lm(Y~X)

plot(Y~X)

curve(fit$coefficients[1]+fit$coefficients[2]*x,add = T,col="blue")

plot(fit)

alcoholtobacco1=alcoholtobacco[-11,]
X1=alcoholtobacco1$Tobacco
Y1=alcoholtobacco1$Alcohol

fit1=lm(Y1~X1)
plot(fit1)

alcoholtobacco2=alcoholtobacco1[-10,]
X2=alcoholtobacco2$Tobacco
Y2=alcoholtobacco2$Alcohol

fit2=lm(Y2~X2)
plot(fit2)

summary(fit2)
## 
## Call:
## lm(formula = Y2 ~ X2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5379 -0.3256  0.0855  0.1660  0.6562 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.7974     1.1427   0.698  0.50778   
## X2            1.3864     0.3324   4.171  0.00418 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.398 on 7 degrees of freedom
## Multiple R-squared:  0.7131, Adjusted R-squared:  0.6721 
## F-statistic:  17.4 on 1 and 7 DF,  p-value: 0.004185

Solution

Estimating \(S_{\widehat{Y}}=S_e=S_{SSE}\)

How do we estimate \(\sigma^2\) from the bivariate data collected? We return to the sum of the squared error terms, \(SSE\), and consider its expected value…remember \(\widehat{Y}\) is conditioned on having \(X_i\) (meaning \(X_i\)’s can be considered as constants not random variables) but first we will need some preliminary results.

\(E[\widehat{\beta}_1 ]=\)

\(V[\widehat{\beta}_1 ]=\)

\(E[\widehat{\beta}_0 ]=\)

\(V[\widehat{\beta}_0 ]=\)

\(E[\widehat{\sigma}^2 ]=S_e^2=\frac{SSE}{n-2}\overset{def}{\Longleftrightarrow} MSE\)

Let’s return to our example of ten student’s height and weight, and try to interpret the quality of our simple linear model…

X <- c(63,64,66,69,69,71,71,72,73,75);
Y <- c(127,121,142,157,162,156,169,165,181,208);
Student=1:10;


reg1=data.frame("Student ID"=Student,height=X,weight=Y)
fit=lm(weight~height,data = reg1)
summary(fit)
## 
## Call:
## lm(formula = weight ~ height, data = reg1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2339  -4.0804  -0.0963   4.6445  14.2158 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -266.5344    51.0320  -5.223    8e-04 ***
## height         6.1376     0.7353   8.347 3.21e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.641 on 8 degrees of freedom
## Multiple R-squared:  0.897,  Adjusted R-squared:  0.8841 
## F-statistic: 69.67 on 1 and 8 DF,  p-value: 3.214e-05

## 
## Call:
## lm(formula = weight ~ height)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2339  -4.0804  -0.0963   4.6445  14.2158 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -266.5344    51.0320  -5.223    8e-04 ***
## height         6.1376     0.7353   8.347 3.21e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.641 on 8 degrees of freedom
## Multiple R-squared:  0.897,  Adjusted R-squared:  0.8841 
## F-statistic: 69.67 on 1 and 8 DF,  p-value: 3.214e-05

Solution